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Transcription factors control which information in a genome becomes transcribed to produce RNAs that function in the 
biological systems of cells and organisms. Reliable and comprehensive information about transcription factors is invaluable 
for large-scale network-based studies. However, existing transcription factor knowledge bases are still lacking in well- 
documented functional information. 

Here, we provide guidelines for a curation strategy, which constitutes a robust framework for using the controlled voca- 
bularies defined by the Gene Ontology Consortium to annotate specific DNA binding transcription factors (DbTFs) based on 
experimental evidence reported in literature. Our standardized protocol and workflow for annotating specific DNA bind- 
ing RNA polymerase II transcription factors is designed to document high-quality and decisive evidence from valid experi- 
mental methods. Within a collaborative biocuration effort involving the user community, we are now in the process of 
exhaustively annotating the full repertoire of human, mouse and rat proteins that qualify as DbTFs in as much as they are 
experimentally documented in the biomedical literature today. The completion of this task will significantly enrich Gene 
Ontology-based information resources for the research community. 

Database URL: www.tfcheckpoint.org 



Introduction 

Specific gene regulation mechanisms determine which part 
of the genome becomes transcribed to provide the active 
molecular parts of living organisms in various environmen- 
tal conditions. Central in these mechanisms are multipro- 
tein complexes present at the regulatory regions of genes 
that determine the onset and rate of RNA synthesis by 



regulating RNA polymerase activity (1, 2). These multipro- 
tein complexes comprise general transcription factors, 
general co-factors (3), RNA polymerase II (RNAP II) se- 
quence-specific DNA binding transcription factors (DbTFs) 
(4) and a large array of transcriptional regulators that 
lack DNA-binding activity but exert their regulatory roles 
through protein interaction with the aforementioned pro- 
teins (which include co-activators, co-repressors, histone 
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modifiers and chromatin remodeling proteins (1, 2). 
General transcription factors bind to core-promoter DNA 
where they constitute pre-initiation transcription complex- 
es, in concert with general co-factors, whereas DbTFs bind 
to gene-specific proximal and distal gene regulatory 
regions. RNAP II, one of the three nuclear RNA polymerases 
(RNAP I, II and III) involved in transcription of mammalian 
genes, draws special attention in studies directed at gene 
regulatory mechanisms, as it is responsible for transcribing 
protein-coding genes as well as miRNA and other RNA 
genes (5). 

Owing to their selective binding within regulatory 
regions of distinct genes, the DbTFs play decisive roles in 
directing the assembly of the multiprotein transcription 
machinery to a particular subset of genes. This assembly 
can either be followed by immediate RNAP ll-dependent 
transcription or it can result in promoter-proximal pausing 
of RNAP II that can subsequently be released into active 
transcription triggered by either DbTFs or other mechan- 
isms (1, 6, 7). DbTFs also play a central role in transcription 
repression either by competing with activating DbTFs for 
DNA binding or by recruiting transcriptional co-repressors 
(2, 8). Through these functions, DbTFs link the phenotypical 
state of the cell — reflected in abundance and activation 
state of proteins in the transcriptional machinery — to the 
decoding of regulatory information embedded within the 
genome sequence. Thus, the DbTFs are a point of conver- 
gence for mechanisms involved in upward causation, i.e. 
the flow of information from genome to phenome (central 
dogma), as well as in downward causation, which enables 
the organism to respond to cues from the extrinsic and 
intrinsic environment (9). 

Current estimates suggest that the human genome con- 
tains ~1900 DbTF-coding genes (10). With the increasing 
trend to pursue a systems-level understanding of gene 
regulatory networks (11), it is of key importance to have 
available genome-wide and accurate information concern- 
ing DbTFs including their specific roles in transcription regu- 
lation, their target genes (TGs) and their expression 
patterns related to cell type and to developmental as well 
as to normal- and pathophysiological processes. This need 
for genome-wide information has sparked (among others) 
the ENCODE project, an initiative to identify all functional 
elements in the human genome sequence and the regula- 
tory interactions between TFs and their transcription factor 
binding sites (TFBS) (12). Thus, experimental data will con- 
tinue to become available in ever-increasing volumes, and 
subsequent comprehensive annotation of functional 
aspects of DbTFs in public databases will be of high value 
for ongoing and future gene regulatory studies. 

The Gene Ontology (GO) provides a common vocabulary 
for the functional description of genes and gene products 
and consists of three sub-ontologies: Biological Process (BP), 
Molecular Function (MF) and Cellular Component (13). 



The Gene Ontology Consortium (GOC) provides high-qual- 
ity classifications for types of transcription factors and cap- 
tures the supporting evidence for the assignment of classes 
to gene products. Recently (2010-2011), the GOC under- 
took a major reorganization of the representation of tran- 
scription factors within GO to bring this area up-to-date 
with current knowledge, to incorporate some advances in 
the ontological representation allowed in GO and to make 
all of the transcription factor terms conform to the prin- 
ciple that terms in the MF aspect of GO should represent 
knowledge about the mechanism of action of that func- 
tion, e.g. # DNA binding', 'RNA polymerase binding' or 'tran- 
scription factor binding'. 

The reorganization of the transcription factor MF terms 
generated a more robust ontology structure by improving 
both textual definitions and relationships between terms in 
the ontology structure [(14); see also Supplementary 
Material 1 for additional comments on background and 
orientation for the reorganization]. 

For example, nucleic acid-binding transcription factors 
must have nucleic acid-binding activity to function and 
also must regulate transcription. Thus, the MF terms for 
types of 'nucleic acid binding transcription factor activity' 
are required to have 'has_part' relationships to the appro- 
priate MF terms for 'nucleic acid binding' [e.g. 'sequence- 
specific DNA binding RNA polymerase II transcription factor 
activity' (GO:0000981) has_part 'RNA polymerase II regula- 
tory region sequence-specific DNA binding' (GO:0000977)] 
(see Figure 1). Equally important, MF 'transcription factor 
activity' terms [e.g. 'sequence-specific DNA binding RNA 
polymerase II transcription factor activity' (GO:0000981)] 
are also required to have 'part_of relationships to appro- 
priate BP terms for 'regulation of transcription' (e.g. 'regu- 
lation of transcription from RNA polymerase II promoter' 
(GO:0006357)], as the overall biological objective of the 
function of the molecule is to take part in regulating tran- 
scription. These 'part_of relationships between a specific 
MF term and a BP term represent a previous advance in 
the use of relationships within the GO structure to provide 
more contextually-dependent MF terms, e.g. when the 
same enzymatic activities are used in more than one pro- 
cess. In the course of revising the transcription section of 
GO, we incorporated these 'part_of links from MF to BP 
terms to provide more complete representation of the 
'transcription factor activity' terms, which are located 
within the MF aspect of GO. Examples of these 'has_part' 
and 'part_of relationships for these MF terms are shown in 
Figure 1. Retention of a generic 'transcription factor activ- 
ity' does not make sense in the MF ontology because from a 
MF viewpoint it is equivalent to an otherwise unknown MF 
that regulates transcription. However, the BP term 'tran- 
scription, DNA dependent' can be used to annotate all 
gene products that regulate transcription, even when the 
mechanism of action is not known. 
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Today (GO database release on 16 February 2013), the 
GOC provides annotations that allow for identification of 
~300 human, mouse and rat DbTFs, which is ~15% of the 
expected DbTFs (10). Only ~200 of these are presently sup- 
ported by experimental evidence, whereas ~100 are anno- 
tated with evidence based on computational prediction, 
sequence and structure similarity or author statement 
(GO database release on 16 February 2013). There are sev- 
eral mammalian DbTF databases, including TFcat (15), 
JASPAR (16) and TFe (17), that also hold experimentally 
documented DbTF information based on cited scientific lit- 
erature. However, these databases lack informative anno- 
tations founded on ontologies and evidence codes (as 
provided by the GOC), which are necessary for rigorous 
computational reasoning and analysis. 

The above findings suggest that, to date, no single com- 
prehensive information resource for mammalian DbTFs 
exists with the level of coverage and high-quality annota- 
tion that is needed for genome-scale data analysis and in- 
terpretation. The GOC has standard procedures for 
annotating proteins, and their database is authoritative in 
providing comprehensive annotations to the myriad of 
tools that use GO information for data analysis. However, 
the capacity of expert curators at the GOC is presently not 
scaled for or focused for dedicated efforts to comprehen- 
sively annotate one particular functional protein class. 
Therefore, we have embarked on a collaborative effort 
involving community users and GOC members to exhaust- 
ively curate experimentally documented mammalian 
DbTFs. Similar to other sub-domain annotation initiatives 
(18, 19), our first aim was to develop specific guidelines 
for curating experimentally documented DbTFs from litera- 
ture. This included the assembly of a list of experimental 
assays that would qualify to provide verifiable functional 
evidence for genuine DbTFs. Here, we provide a detailed 
report in the form of a comprehensive curation protocol, 
based on which we are currently engaged in a focused 
effort to curate all experimentally characterized DbTFs 
from a collection of candidate proteins compiled from the 
major TF information sources. A database providing 
detailed information about TF information sources and 
assembled DbTF documentation is available at www. 
tfcheckpoint.org. 

Creation of annotations for 
sequence-specific DNA binding 
RNAPII Transcription Factors 
(DbTFs) 

Our curation guidelines for high-quality annotation of 
experimentally verified DbTFs are designed to capture the 
essential functional capabilities of DbTFs and record pub- 
lished evidence using rigorous semantics. In the following 



sections, we describe fundamental functional characteris- 
tics of a DbTF, how these characteristics can adequately 
be described by GO terms and how these terms and 
evidence codes can be asserted based on experimental 
work reported in literature. The assembled procedure 
facilitates a precise representation of DbTF functional attri- 
butes using the standard GOC-defined gene-association file 
format (GAF2.0; http://www.geneontology.org/GO.format. 
gaf-2_0.shtml) and the PSI-MI data exchange format used 
for recording interaction data (20). A detailed DbTF anno- 
tation guideline document is provided in Supplementary 
Material 2. 

Criteria that qualify a DbTF 

A DbTF is a DNA binding transcription factor that binds to a 
specific DNA sequence and regulates the transcription of 
the associated gene. The specific DNA sequences bound 
by DbTFs are termed TFBS, and for RNAPII these are located 
in gene regulatory regions either upstream and proximal to 
the core promoter, or in more distal upstream or down- 
stream enhancer regions. Once a DbTF recognizes a TFBS, 
it may recruit other accessory factors or RNAPII, or it may 
interfere with binding of other regulatory proteins to regu- 
late the expression of the TG. This means that a DbTF must 
exhibit both DNA-binding and transcription regulation cap- 
acity. Therefore, the minimum criteria to qualify a protein 
as DbTF for RNAPII are that it (i) binds to specific DNA 
sequences in gene regulatory regions and (ii) is involved 
in RNAPII-dependent regulation of transcription. 

It is evident that to capture these functional aspects 
accurately and efficiently, the specific GO terms that sub- 
stantiate these assertions need to be precisely defined. 
These GO terms must address both 'sequence-specific 
DNA binding' and 'transcription regulation' capabilities 
accurately. In the following sections, we provide a detailed 
reasoning behind the selection of specific GO terms of dif- 
ferent granularity as well as assignment of GO evidence 
codes and experimental assays that are considered 
adequate and necessary for creating a DbTF annotation. 

GO terms used for DbTF annotation 

Specific DNA binding. To capture the capability of a 
protein to bind to specific DNA sequences, a GO MF term 
that describes 'sequence-specific DNA binding' (e.g. 
GO:0043565 'sequence-specific DNA binding') should be 
used. GO:0000976 'transcription regulatory region se- 
quence-specific DNA binding' should be used when it is 
not possible to identify information stating that the regu- 
latory region containing the DNA sequence specifically 
bound by the DbTF is part of a gene regulated by RNAP 
II. Where a gene is known to be transcribed by RNAP II, a 
more specific term (GO:0000977 'RNA polymerase II regula- 
tory region sequence-specific DNA binding') may be 
applied. If information exists that indicates whether the 
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protein binds the proximal or distal regulatory regions, this 
may be indicated by use of either of the terms describing 
the location of binding (GO:0000978 'RNA polymerase 
II proximal region sequence-specific DNA binding' or 
GO:0000980 'RNA polymerase II distal enhancer sequence- 
specific DNA binding') (Figure 1A, terms shaded yellow). 

Transcription regulation. The involvement of a 
protein in transcription regulation is well captured by the 
GO BP terms GO:0006357 (regulation of transcription 
from RNAP II promoter) or any of its children that 
specify whether the protein is involved in positive or 
negative regulation of transcription (Figure 1A, terms 
shaded blue). 

Sequence-specific DNA binding RNAP II transcrip- 
tion factor activity. The goal of this curation project is 
to assign a sequence-specific DbTF activity term, i.e. 
GO:0000981 (sequence-specific DNA binding RNAP II tran- 
scription factor activity) or one of its children to appropri- 
ate DbTFs (Figure 1A, terms shaded green). As indicated 
above, this requires that the composite functional aspects 
of DbTF proteins — specific DNA binding and transcription 
regulation — must each be represented by their proper MF 
and BP GO terms. These different aspects of DbTF activity — 
specific DNA binding and involvement in transcriptional 
regulation — are typically demonstrated in different experi- 
ments, sometimes not even presented in the same paper, so 
the annotations to specific DNA binding (MF) and transcrip- 
tional regulation (BP) terms are made separately, and only 
when both are assigned (each in their inherent logic of the 
GO-structure) can they be combined to infer DbTF activity 
MF terms (Table 1). 

The child terms of # GO: 0000981' are used to delineate 
whether the TF exerts its activity by binding to the 
promoter proximal region or the distal enhancer, i.e. 
'GO:0000982 (RNAP II core promoter proximal region 
sequence-specific DbTF activity)' or 'GO:0003705 
(sequence-specific distal enhancer binding RNAP II tran- 
scription factor activity)', and whether the result of binding 
is positive or negative regulation of TG transcription, e.g. 
'GO:0001077 (RNAP II core promoter proximal region se- 
quence-specific DbTF activity involved in positive regulation 
of transcription)' and 'GO:0001205 (RNAP II distal enhancer 
sequence-specific DbTF activity involved in positive regula- 
tion of transcription)'. Genes that have been shown to have 
both positive and negative regulatory roles should be 
annotated with both the positive and negative regulation 
terms as appropriate. 

TF binding and TF binding TF activity. Transcriptional 
regulation mechanisms are complex. Usually many TFs work 
together in concert to regulate transcription. In instances 
where the activity of a TF is reported to be dependent on 
interaction with another protein or multi-subunit complex, 



the protein-protein interaction (PPI) is annotated using 
'transcription factor binding' MF GO terms as shown in 
Figure 1B (terms shaded yellow). Furthermore, a different 
set of 'transcription factor activity' terms, i.e. 'GO:0001076 
(RNAP II transcription factor binding transcription factor 
activity)' or any of its children, is chosen to reflect the fact 
that the activity is dependent on binding to another TF 
(Figure 1B, terms shaded green). 

Once TF binding and transcription regulation are each 
annotated individually, the GO structure allows for the 
generation of TF binding TF activity annotations by combin- 
ing the separate annotations (Table 1). 

When the functional unit of a TF is a complex 

In instances where the complex is a homodimer or higher 
order multimer of the same protein, there are no special 
annotation issues, as all of the activities demonstrated are 
properties of the same gene product. However, when the 
functional unit is a heterodimer or other multisubunit com- 
plex, then there are some additional considerations for 
annotation. 

The 'contributes_to' qualifier is specifically intended for 
the annotation of functions that occur in the context of 
complexes, rather than being an activity of a single subunit 
within the complex. In the case of a heterodimer, there are 
times where one of the two proteins does not bind DNA on 
its own. However, in some cases, a subunit that does not 
bind DNA independently can be shown to contribute to the 
sequence specificity of binding when present within a het- 
erodimer. In this situation, the subunit that does not bind 
DNA alone could be annotated to appropriate 'sequence- 
specific DNA binding' terms (Figure 1A, terms shaded 
yellow) using the qualifier 'contributes_to' to indicate 
that it contributes to the DNA binding of the heterodimer. 
More generally, the 'contributes_to' qualifier can be used 
in conjunction with any MF term, including the 'transcrip- 
tion factor activity' terms, to indicate that it contributes to 
that function within the context of a complex, even though 
it does not possess that activity independently. In contrast, 
in a multisubunit TF where the DNA binding activity is 
known to be confined to one or more specific subunits, 
other subunits should not be annotated to a 'specific 
DNA binding' term at all. 

For any subunit within a TF complex, it is appropriate 
to annotate all appropriate GO terms for which that 
function has been experimentally shown, either individu- 
ally or as part of the complex indicated with the 
'contributes_to' qualifier. Thus, in some cases, a given pro- 
tein may be annotated both with a 'sequence-specific DNA 
binding RNAP II transcription factor activity' term as well as 
with a 'TF binding RNAP II transcription factor activity' 
term. 
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Table 1. Inference of transcription factor activity terms from DNA/TF binding and transcription regulation terms 
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Each transcription factor activity term (green) is determined by the composite annotation of the corresponding DNA binding or TF 
binding term (yellow) and a transcription regulation term (blue). 



Evidence codes and experimental assays 

In accordance with the overall guidelines for GO annota- 
tions, each DbTF annotation must be qualified with an 
evidence code indicating how the annotation is supported 
by experimental evidence (http://www.geneontology.org/ 
GO.evidence.shtml). The DbTF curation guidelines pre- 
sented in the current work use one of the following GO 
evidence codes: Inferred from Direct Assay (IDA), Inferred 
from Physical Interaction (I PI), Inferred from Mutant 
Phenotype (IMP) or Inferred by Curator (IC). 



When a single scientific paper comprises all experimental 
evidence necessary to support each of the annotations for 
'DNA- or TF-binding' and Transcription regulation', the evi- 
dence codes for these two annotations are transferred to 
the composite DbTF annotation to a MF 'transcription 
factor activity' term (see Table 2). However, when the 
two different types of annotations ('DNA' or 'TF-binding' 
and 'transcription regulation') for a given TF cannot be 
generated from one single paper, the evidence code IC 
is used along with the GOC-generated reference, 
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Table 2. Evidence code table 
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GO_REF:0000036 (http://www.geneontology.org/cgi-bin/ 
references.cgi#GO_REF:0000036). The IC code, which 
requires the use of the two GO IDs for the appropriate 
'binding' and 'transcription regulation' terms, indicates 
that GO annotations based on evidence from two different 
sources have been combined by a curator to infer the 
appropriate transcription factor activity term. 

To provide for a uniform standard for evaluation of 
experimental evidence for DbTF annotations, we surveyed 
several relevant resources defining experimental assays 
that can document TF function, including ORegAnno (21), 
TRRD (22), RegulonDB (23) and the PSI-MI controlled vo- 
cabulary for molecular interactions (20). 

In the following sections, we have compiled sets of se- 
lected experimental assays that we deem to be most relevant 
for annotation of DNA binding, TF binding and transcription 
regulation. PSI-MI-unique identifiers are given wherever 
they exist. Augmentation of the PSI-MI vocabulary to span 
a larger repertoire of TF-defining experiments is ongoing. 

Specific DNA binding. Experimental data documenting 
specific DNA binding are obtained from experiments that 
show in vitro binding of a TF to specific DNA sequences 
present in either cloned TG regulatory regions (proximal 
promoter and/or distal enhancer) or in synthetic DNA 
sequences representing canonical TF binding sites or spe- 
cific TG regulatory regions (see Table 3). We have chosen 
not to rely on assays measuring in vivo TF-DNA interaction 
(e.g. the Chromatin ImmunoPrecipitation assay) because it 
is not possible to ascertain in these assays that the TF in 
question actually binds directly to DNA, or whether some 
other component in the in vivo system mediates the 
TF-DNA association. 

The in vitro assay that has been most frequently used for 
documenting sequence-specific binding of TF is the 
Electrophoretic Mobility Shift Assay (EMSA) (24). The most 
common variants of this assay present the TF in the form of: 

(i) nuclear extract from native tissue or cells 

(ii) nuclear extracts from cells or tissue with ectopic 
expression of a TF 



(iii) purified TF (in vitro translated or purified from cell 
extract) 

(iv) nuclear extract from cells with ectopic expression of a 
mutated TF 

(v) purified mutated TF (in vitro translated or purified 
from cell extract). 

When the TF is presented in any of the variants (ii-v), the 
EMSA qualifies for annotation of a GO term for 'specific 
DNA binding'. In the case where the TF is presented as a 
nuclear extract from native cells or tissue (i), we require 
that the specific TF is identified with an additional experi- 
mental approach. This may involve specific competition 
experiments demonstrating that the EMSA gel shift is not 
abolished by competition with an unlabeled DNA probe 
with a point mutation in a known TFBS for this specific 
TF, whereas competition with unlabeled DNA probe 
containing the wild-type TFBS does abolish the gel shift. 
Also, the use of a TF-specific antibody, i.e. EMSA supershift, 
will increase confidence in EMSA assays with nuclear ex- 
tracts from native tissue or cells; however, these assays 
must be interpreted with caution, as the DNA-protein com- 
plex may be shifted even though a different protein than 
the one recognized by the antibody provides for the DNA- 
binding part in the complex. If no additional experimental 
verification of the TF is reported, nuclear extract-based 
EMSAs of type (i) do not suffice to qualify DNA binding 
properties of a TF, and the experiment needs to be 
dismissed. 

Similarly, the other assays listed in Table 3 must have 
been performed in a manner that provides for identifica- 
tion of the specific TF tested and to assess specific inter- 
action between this TF and a specified DNA probe. For 
Ml:0114 X-ray crystallography, to qualify as experimental 
evidence of a TFs DNA binding, it is required that the pro- 
tein is co-crystallized with a DNA sequence that represents 
either a canonical TFBS or an authentic gene regulatory 
region. 

Transcription regulation. The 'transcription regula- 
tion' terms need support from assays that document modu- 
lation of transcriptional process in response to TF action. 
These assays mainly fall into two groups: either reporter 
gene assays measuring the transcriptional regulatory 
effect of a TF on a regulatory region cloned upstream of 
a reporter gene (e.g. luciferase, beta-galactosidase or chlor- 
amphenicol acetyltransferase), or measurement of expres- 
sion levels of a TG mRNA (see Table 4). Within each of the 
assays, a variety of experimental strategies can allow for 
the identification of the specific TF [e.g. 'knock in' (ectopic 
expression) and/or 'knock down']. Furthermore, the gene 
regulatory region can be presented and assessed in differ- 
ent ways in the reporter gene assays (e.g. 'canonical TFBS' 
or 'authentic TG promoter/enhancer') and different 
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Table 3. Assays documenting specific DNA binding 



Experimental assays 


Variants 


Evidence code 


PSI-MI 
code 


EMSA 


Nuclear extract from native tissue or cells 


No evidence 


Ml:0413 




Nuclear extracts from cells or tissue with 


IDA 


Ml:0413 




ectopic expression of a TF 








Purified TF (in vitro translated or purified 


IDA 


Ml:0413 




from cell extract) 








Nuclear extract from cells with ectopic 


IMP 


Ml:0413 




expression of a mutated TF 








Purified mutated TF (in vitro translated 


IMP 


Ml:0413 




or purified from cell extract 






Electrophoretic mobility supershift assay 


Nuclear extract from native tissue or cells 


IDA 


Ml:0412 


(EMSA supershift) 


Nuclear extracts from cells or tissue with 
ectopic expression of a TF 


IDA 


Ml:0412 




Purified TF (in vitro translated or purified 


IDA 


Ml:0412 




from cell extract) 








Nuclear extract from cells with ectopic 


IMP 


Ml:0412 




expression of a mutated TF 








PiirifioH m i i+PitoH TF (in \/itrn trPin^lPi+pH 

r ui m icu 1 1 1 u la icru 11 \" ' vi n u 1 1 a i I o I a icu 


IMP 


MI-0412 




or purified from cell extract) 






Footprinting 




IDA 


Ml:0417 


DNase I footprinting (DNA footprint) 




IDA 


Ml:0606 


Methylation interference assay (MIC) 




IDA 


Ml:1189 


Ultraviolet (UV) footprinting (UV-footprint) 




IDA 


Ml:1191 


Dimethylsulphate footprinting (DMS-footprint) 




IDA 


Ml:0603 


Hydroxy radical footprinting (Hydroxy-footprint) 




IDA 


Ml:1190 


Potassium permanganate footprinting (KMn04-footprint) 




IDA 


Ml:0604 


Affinity chromatography technology 




IDA 


Ml:0004 


Pull down 




IDA 


Ml:0096 


Southwestern blot assay (SW-blot) 




IDA 




In vitro evolution of nucleic acids (SELEX) 




IDA 


Ml:0657 


X-ray crystallography 




IDA 


Ml:0114 



The experimental assays are denoted with their standard nomenclature in PSI-MI; for the detailed description please see: http://www.ebi. 
ac.uk/ontologylookup/browse. do?ontName=MI. For Southwestern blot assay, see: http://www.nlm.nih.gov/mesh/. 



methods used to assay mRNA expression levels of specific 
TGs. The combinations of different modes of TF and TG 
detection together define the GO evidence codes to be 
used (Table 4). 

Although the experimental assays depicted in Table 4 
are most often carried out by transfecting expression and 
reporter plasmids into cell line model systems, transcription 
regulation annotations can also be supported by whole or- 
ganism experiments, e.g. knock out mutations or 
RNAi knock down strategies. However, as such experiments 
do not by themselves prove a role in regulation of tran- 
scription, such annotations must be made with caution 
and will depend on a strict awareness of additional 
information such as the concomitant documentation of 
specific binding by the protein in question, to regulatory 
regions of an RNAP II regulated gene (e.g. by Chromatin 
ImmunoPrecipitation assay analysis). 



TF binding. 'TF binding' specific terms are based on any 
assay that provides evidence for PPIs. Table 5 lists experi- 
mental assays and evidence codes that are eligible for TF 
binding specific terms. Currently, we are only making these 
annotations from 'small scale' papers that we come across 
for proteins that are also DbTFs or for proteins with se- 
quence similarity to DbTFs but which do not appear to 
bind DNA. Any future extension of this work to use high 
throughput PPI data would need to carefully consider what 
standards should be applied to minimize the effect of the 
high level of false-positives in high throughput PPI data. 

Annotating TGs 

An obvious important biological property of a TF lies in the 
particular TGs that it regulates. Proper recording of this 
information is of key importance for the building of gene 
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Table 4. Reporter gene-based assays variants documenting transcription regulation 



Transcription regulation assays 



TF 

identification 



reporter gene assay 



TG expression assay 



canonic 
a I TFBS 



authentic 
TG 

promoter 



authentic TG 
promoter with 
TFBS point 
mutation 



authentic TG 
promoter with 
deletion 
mutations 



primer 
specific PCR 
(e.g. RT-PCR, 
qRT-PCR) 



northern 
blot 



Ribonuclease 

protection 

assay 



wt TF 

overexpression 


IDA 


IDA 


IDA 


IDA 


IDA 


IDA 


IDA 


mutTF 

overexpression 


IMP 


IMP 


IMP 


IMP 


IMP 


IMP 


IMP 


TF knock down 


IMP 


IMP 


IMP 


IMP 


IMP 


IMP 


IMP 



(RNAi/antisense 
RNA) 



This table is a decision matrix for selecting GO evidence codes based on the method used for TF identification (purple) and transcription 
regulation (green), wt = wild type, mut = mutated. 



Table 5. Assays documenting TF binding 



Assays 


Evidence code 


PSI-MI code 


2-Hybrid interactions 


IPI 


Ml:0018 


Co-purification 


IPI, IDA 


Ml:0004 


Co-immunoprecipitation 


IPI, IDA 


Ml:0019 



The experimental assays are denoted with their standard nomen- 
clature in PSI-MI; for the detailed description please see: 
http://www.ebi.ac. uk/ontologylookup/browse.do?ontName=M I. 



regulatory networks. In studies of DbTF functionality, often 
one or several specific TGs will be identified and experimen- 
tally documented. The GOC has introduced an Annotation 
Extension field to capture additional information that 
provides more biological context to the GO annotation 
(GAF 2.0, http://www.geneontology.org/GO.format.gaf-2_ 
O.shtml). This field can be used to record information 
regarding specific TGs regulated by the TF that is being 
annotated. The TG is recorded in the Annotation 
Extension field for the BP transcription regulation GO 
term using the 'has_regulation_target' relationship com- 
bined with the gene identifier(s) for the TG(s). 

Work flow of annotation 

The annotation workflow is depicted in Figure 2. An anno- 
tation effort typically starts with one of the scientific papers 
suggested in databases such as TFCat and JASPAR to docu- 
ment a candidate DbTF, or by searching for adequate lit- 
erature in one of the following resources: UniProt (http:// 
www.uniprot.org/), NCBI's Entrez Gene (25), iHOP (26), 



Gene Cards (27) or NCBI's PubMed (http://www.ncbi.nlm. 
nih.gov/pubmed/). Each scientific paper is first checked for 
information providing correct identification of species 
origin of the TF studied. Because we are focusing on 
DbTFs from human, mouse and rat studies, only papers 
allowing identification of a DbTF from one of these species 
will proceed to further curation. Thus, a number of papers 
that fail to clearly identify the species of the gene(s) used in 
their construct(s) have to be omitted from the curation pro- 
cess. Then, the paper is searched for adequate experimental 
evidence to support one or several DbTF annotations. If 
either TF species origin or sufficient experimental evidence 
is not identifiable, the curator returns to the scientific lit- 
erature corpus to search for other suitable papers. When 
both criteria are fulfilled, the individual GO annotations 
(i.e. specific DNA binding and/or TF binding and transcrip- 
tion regulation) are assigned together with a supporting 
evidence code. Finally, the composite TF activity MF GO 
term(s) is inferred. TF annotation data are submitted to 
UniProt-GOA in the form of a gene association file 
(GAF2.0; http://www.geneontology.org/GO.format.gaf-2_ 
O.shtml) and will subsequently appear in the GOC database 
via tools such as AmiGO (http://amigo.geneontology.org/) 
and QuickGO (http://www.ebi.ac.uk/QuickGO/; Figure 3). 

DISCUSSION 

Benefits of a focused annotation project 

A comprehensive resource of high-quality annotations of 
TFs is of high value both for small-scale experiments 
where it is important to select an optimal subset of relevant 
TFs and for genome-scale studies. In the latter case, access 
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^ Scientific literature corpus"^ 




Figure 2. Sequence-specific DNA binding TF (DbTF) curation 
workflow. This workflow represents the step-by-step proced- 
ure for curating experimentally verified mammalian DbTFs 
from scientific publications. Selection of scientific publication 
from the literature corpus is the starting point of the curation 
procedure. From each relevant publication, DbTF-specific 
GO-terms are annotated and recorded. 



to extensive background knowledge for TFs is essential to 
infer gene regulatory networks (28) or to design experi- 
ments to characterize this group of proteins as a functional 
class in a system-wide approach (29, 30). 

Compilation and in-depth analysis of available informa- 
tion on transcription factors indicate that >800 mammalian 
DbTFs are experimentally documented in the scientific lit- 
erature (www.tfcheckpoint.org). The current work aims to 
provide the foundation to curate this source of information 



and to record adequate GO annotations in compliance with 
the standards defined here. Currently (GO database release 
on 16 February 2013), only 202 human, mouse and rat pro- 
teins are annotated as DbTFs with # GO:0000981 sequence- 
specific DNA binding RNA polymerase II transcription factor 
activity' (or any of its child terms) supported by experimen- 
tal evidence, meaning that some 600 DbTFs still need to be 
processed. We aim to complete this task before the end of 
2013. Even though the number of curators involved is small, 
the efficiency of this focused annotation project is high, as 
the number of different GO terms and evidence codes is 
limited and well defined, thus allowing each curator to pro- 
cess a relatively high number of scientific papers (typically 
five papers or more per working day). 

Added value of rigorous classification of experimental 
assay requirements for the annotations 

The catalogue of experimental assays that qualify for sup- 
porting TF annotations presented here is assembled based 
on the extensive TF annotation experience in the collabor- 
ating organizations. This aspect of the annotation proced- 
ure improves the quality of the GO annotations, as it 
provides a uniform standard for interpretation of evidence 
strength in published experimental work. As some of the 
assays presently are not adequately covered by PSI-MI 
vocabulary (20), a part of our efforts has been directed to 
collaborate with the PSI-MI consortium to develop add- 
itional PSI-MI terms. The proper documentation of experi- 
mental evidence for each TF annotation will enable us to 
work toward submitting annotated data to the IntAct data- 
base (31). Moreover, we plan to make the experimental 
assay details for the TF annotations available to users via 
our TF database (http://www.tfcheckpoint.org/). This will 
enable users to select subsets of TFs based on the specific 
experimental methods used to characterize them. 



Concluding remarks 



Metadata are rarely presented in biomedical publications 
using formalized knowledge representation. This often 
makes it difficult for a curator to extract accurate informa- 
tion for ontology- or structured vocabulary-annotation 
from natural language used in the literature. The GOC pro- 
vides guidelines for the curation of gene products informa- 
tion from scientific publications and procedures for 
identification of the type of evidence that supports the 
curated information. Because of these standardized con- 
ventions, literature-curated data in the GO database are 
deemed to be of high utility and quality. In the present 
work, we have established a comprehensive and specific 
curation procedure for TFs of RNAP II, which, similar to 
other data standardization initiatives, provides details on 
the requirements to properly record an experimentally ver- 
ified DbTF. 
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Click for example search 



> Services Dataset Term Basket 0 Display ID Mapping Filter Statistics Download 



A < Displaying annotations 1 to 25 of 30 for 1 2 proteins ^ 


Page size: 25 3 (Show All) Additional fitters: None <*»Bookmarkable link 




Database Gene Symbol Qualifier GO Identifier GO Term Name 
Product 
ID 


Aspect Evidence Reference 


With Taxon Date Assigned Product! 

By Form IdI 



UniProtKB 035738 Klf12 

UniProtKB 035738 Klf12 

UniProtKB 035738 KII12 

UniProtKB 035739 Klf9 

UniProtKB 035739 KITS 

UniProtKB 035739 Klf9 

UniProtKB 043248 H0XC11 

UniProtKB 043248 H0XC11 

UniProtKB 043248 H0XC11 

UniProtKB P09079 Hoxb5 

UniProtKB P09079 Hoxb5 

UniProtKB P09079 Hoxb5 



00:0000122 negative regulation of transcription from RNA polymerase II P IDA 
promoter 

GO:0000977 RNA polymerase II regulatory region sequence-specific DNA F IDA 

binding 

GO:000122 7 RNA polymerase II transcription regulatory region sequence- F IDA 

specific DNA binding transcription factor activity Involved In 

negative regulation of transcription 
GO:0000977 RNA polymerase II regulatory region sequence-specific DNA F IDA 

binding 

GO:0001228 RNA polymerase II transcription regulatory region sequence- F IDA 
specific DNA binding transcription factor activity involved in 
positive regulation of transcription 

GO:QQ45944 positive regulation of transcription from RNA polymerase II P IDA 

promoter 

GO:Q000978 RNA polymerase II core promoter proximal region sequence- F IDA 
specific DNA binding 

GQ:QQQ1Q77 RNA polymerase ll core promoter proximal region sequence- F IDA 
specific DNA binding transcription factor activity Involved in 
positive regulation of transcription 

GO:0045944 positive regulation of transcription from RNA polymerase II P IDA 

promoter 

GO:QQQ0980 RNA polymerase II distal enhancer sequence-specific DNA F IDA 

binding 

GO:00012 G5 RNA polymerase II distal enhancer sequence-specific DNA F IDA 

binding transcription factor activity involved in positive regulation 
of transcription 

GO:0045944 positive regulation of transcription from RNA polymerase II P IDA 

promoter 



PMID:9858544 

PMID:9858544 
PMID:9858544 

PMID:9858544 
PMID:9858544 

PMID:9858544 
PMID:9582375 

PMID:9582375 

PMID:9582375 
PMID: 12897 140 
PMID:12897140 



10090 20130412 NTNU.SB 
10090 20130412 NTNU_SB 
10090 20130412 NTNU_SB 

10090 20130412 NTNU_SB 
10090 20130412 NTNU_SB 

10090 20130412 NTNU_SB 
9606 20130412 NTNU_SB 
9606 20130412 NTNU_SB 

9606 20130412 NTNU_SB 

10090 20130412 NTNU_SB 
10090 20130412 NTNU SB 



PMID: 12897140 10090 201 3041 2 NTNU_SB 



Figure 3. UniProt-GOA screenshot of some of the DbTF annotations. The annotations generated using the DbTF curation guide- 
lines discussed here can be accessed from the GO database using the QuickGO tool. 



The GOC is centrally involved in efforts to provide 
annotation guidelines for particular protein functional 
categories. However, the elaboration of procedures for spe- 
cific tasks like the curation of distinct functional categories 
of proteins, or of BP subdomains, is enhanced when experts 
in the respective fields are involved in the curation process. 
Moreover, the active participation from domain experts is 
greatly facilitated by generating detailed curation guide- 
lines as vehicles for productive interactions. With the tran- 
scription factor curation effort presented here, we wish to 
provide not only a greater number of high-quality annota- 
tions for DbTFs and their TGs across three mammalian 
species, but also to exemplify the constructive use of 
detailed guidelines to facilitate collaborative biocuration 
efforts across institutions. 
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